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Abstract 

We propose a new algorithm for solving the graph-fused lasso (GFL), a mefhod 
for paramefer estimation fhaf operafes under fhe assumpfion fhaf fhe signal fends 
fo be locally consfanf over a predefined graph sfrucfure. Our key insighf is fo de¬ 
compose fhe graph info a sef of frails which can fhen each be solved efficienfly using 
fechniques for fhe ordinary (ID) fused lasso. We leverage fhese frails in a proximal 
algorifhm fhaf alfemafes between closed form primal updafes and fasf dual frail up- 
dafes. The resulting fechinque is bofh fasfer fhan previous GFL mefhods and more 
flexible in fhe choice of loss function and graph sfrucfure. Furfhermore, we presenf 
fwo algorifhms for consfrucfing frail sefs and show empirically fhaf fhey offer a 
fradeoff befween preprocessing time and convergence rafe. 


1 Introduction 


Let /3 = {/?i : f G V} be a signal whose components are associated with the vertices of 
an undirected graph Q = (V, <S). Our goal is to estimate (3 on the basis of noisy observa¬ 
tions y. This paper proposes a new algorithm for solving the graph-fused lasso (GFL), 
a method for estimating /3 that operates under the assumption that the signal tends to 
be locally constant over Q. The GFL is defined by a convex optimization problem that 
penalizes the first differences of the signal across edges: 

minimize l{y,(3) + X \Pr -/3s\ , n'l 

(r,s)e£’ 


where £ is a smooth convex loss function and n = | V|. 

Many special cases of this problem have been studied in the literature. For example, 
when I is squared error loss and ^ is a one-dimensional chain graph, ([T|) is the ordi¬ 
nary (ID) fused lasso [Tibshirani et al. 2005|. When ^ is a two- or three-dimensional 
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grid graph, Q is often referred to as total-variation denoising [Rudin et al. 19921. In 
both cases, there exist very efficient algorithms for solving the problem. In the ID case 


(i.e. where £ defines a chain graph), a dynamic programming routine due to Johnson 
|2013| can recover the exact solution to Q in linear time. And when the graph corre¬ 
sponds to a 2D or 3D grid structure with distance-based edge weights and a Gaussian 
loss, a parametric max-flow algorithm |Chambolle and Darbon 20091 can be used to 
rapidly solve Q to a desired precision level. Furthermore, when the loss is Gaussian 
but no grid structure is present in the graph, the GFL is a specialized case of the gener¬ 


alized lasso ITibshirani and Taylor 20111. 


However, none of these approaches apply to the case of an arbitrary smooth convex 
loss function and a generic graph (which need not be linear or even have a geometric 
interpretation). In this paper, we propose a method for solving ([TJ in this much more 
general case. The key idea is to exploit a basic result in graph theory that allows any 
graph to be decomposed into a set of trails. After forming such a decomposition, we 
solve Q using a proximal algorithm that alternates between a closed-form primal up¬ 
date and a dual update that involves a set of independent 1-dimensional fused-lasso 
subproblems, for which we can leverage the linear-time solver of Johnson [ 2013| . The 
resulting algorithm is applicable to generic graphs, flexible in the choice of loss function 
I, and fast. To our knowledge, no existing algorithms for the graph-fused lasso satisfy 
all three criteria. A relevant analogy is with the paper by |Ramdas and Tibshiraiii |2014|, 
who also exploit this ID solver to derive a fast ADMM for trend filtering and which can 
be solved on graphs |Wang et al. 2015| . 

We present two variations on the overall approach (corresponding to two different 
algorithms for decomposing graphs into trails), and we investigate their performance 
empirically. Our results reveal that these two variations offer a choice between quickly 
generating a minimal set of trails at the expense of longer convergence times, versus 
spending more time on careful trail selection in order to achieve more rapid conver¬ 
gence. We further benchmark our approach against alternative approaches to GFL min 


imization | Wahlberg et al. 2012| Ghambolle and Darbon) |2009[ [Tansey et al.[ 2014| and 
show that our trail-based approach matches or outperforms the state of the art on a col¬ 
lection of real and s}mthetic graphs. For the purposes of exposition, we use the typical 
squared loss function ^(y, (3) = \ YliiVi ~ though our methods extend trivially to 
other losses involving, for example, Poisson or binomial sampling models. 

The remainder of the paper is organized as follows. Section [^presents our graph de¬ 
composition approach and derives an ADMM optimization algorithm. Section [^details 
our two algorithms for discovering trails and demonstrates the effectiveness of good 
trails with a simple illustrative example. Section [^presents our benchmark experiments 
across several graph types. Finally, Section|^presents concluding remarks. 
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2 ADMM via Graph Decomposition 


2.1 Background 

The core idea of our algorithm is to decompose a graph into a set of trails. We therefore 
begin by reviewing some basic definitions and results on trails in graph theory. Let 
Q = (V, £) be a graph with vertices V and edge set £. We note two preliminaries. First, 
every graph has an even number of odd-degree vertices | West 20011 . Second, if Q is not 
connected, then the objective function is separable across the connected components of 
Q, each of which can be solved independently. Therefore, for the rest of the paper we 
assume that the edge set £ forms a cormected graph. 

Definition 1. A walk is a sequence of vertices, where there exists an edge between the preceding 
and following vertices in the sequence. 

Definition 2. A trail is a walk in which all the edges are distinct. 

Definition 3. An Eulerian trail is a trail which visits every edge in a graph exactly once. 
Definition 4. A tour (also called a circuit) is a trail that begins and ends at the same vertex. 


Definition 5. An Eulerian tour is a circuit where all edges in the graph are visited exactly 
once. 


Theorem 1 (West [ 2001| , Thm 1.2.33). The edges of a connected graph with exactly 2k odd- 
degree vertices can be partitioned into k trails if k > 0. If k = 0, there is an Eulerian tour. 
Furthermore, the minimum number of trails that can partition the graph is max(l, k). 


Theorem reassures us that any cormected graph can be decomposed into a set of 
trails T on which our optimization algorithm can operate. Specifically, it allows us to 
re-write the penalty function as 


(r,s)Ci£ iST {r,s)£t 


We first decribe the algorithmic consequences of this representation, and then we turn 
to the problem of how to actually perform the decomposition of Q into trails. 


2.2 Optimization 

Given a graph decomposed into a set of trails, T = {ti,t 2 , ■ ■ ■ ,tk}r we can rewrite ([^ by 
grouping each trail together. 


minimize 

^e7^" 


i{y,P) + xY^ 1 / 3 ,-/ 3 , 

t&T (r,s)£t 


( 2 ) 


For each trail t (where \t\ = m), we introduce m -|- 1 slack variables, one for each 
vertex along the trail. If a vertex is visited more than once in a trail then we introduce 


3 














multiple slack variables for that vertex. We then re-write (|^ as 


minimize 

I3&n”- 

subject to 


£(y,/3) + Zs 

t^'T (r,s)£t 

[3r = Zr 
Ps — Zg . 


( 3 ) 


We can then solve this problem efficiently via an ADMM routine 
the following updates: 


I Boyd et al. ZOllj with 


/3^+i = argmin ( ^(y, /3) + ^ A/3 - + u 




a 


= argmin w; 

^ y r£t 

ufc+i = yfc A/3*'+^ - z 


r Zr)'^ + ^ \Zr Zg] 

/c+1 _ k I A fok-\-l 


t£T 


(4) 

(5) 

( 6 ) 


where u is the scaled dual variable, a is the scalar penalty parameter, w = ^,yr = Pr—Ur, 
and A is a sparse binary matrix used to encode the appropriate /3i for each Zj. Here we 
use t to denote both the vertices and edges along trail t, so that the first summation in 
is over the vertices while the second summation is over the edges. 

In the concrete case of squared-error loss, £{y, (3) = YliilH ~ the /3 updates have 
a simple closed-form solution: 

k+i _ + 

2 + a\J\ ’ 

where J is the set of dual variable indices that map to /3j. Crucially, our trail decom¬ 
position approach means that each trail's z update in (|^ is a one-dimensional fused 
lasso problem which can be solved in linear time via an efficient dynamic programming 
routine 0ohnsonj|2O13| . 

The combination of the closed-form updates in the /3- and u-steps, and the linear¬ 
time updates in the 2 -step, make the overall algorithm highly efficient in each of its com¬ 
ponents. The dominating concern therefore becomes the number of alternating steps 
required to reach convergence, which is primarily influenced by the choice of graph 
decomposition. We turn to this question in Section before describing the results of 
several experiments designed to evaluate the efficiency of our ADMM algorithm across 
a variety of graph types and trail decomposition strategies. 
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Algorithm 1 Pseudo-tour graph decomposition algorithm. Yields a minimal set of trails 
but does not control trail length variance. 

Input: Graph Q = (V, <S) 

Output: Set of trails T 
1: Pseudo-edge set f ^ 0 

2-. g^g 

3: V ■(r- OddDegreeVertices(^) 

4 : while V > 0 

5 : Choose two unadjacent vertices, 

6: Remove Vi, Uj from V 

7: Add pseudo-edge e = (vj, uj) to g and P 

8: Find a tour over the pseudo-graph, i <— EulerianCircuit{g) 

9: Create T = , ^ 2 , • • •, by breaking i at every edge in P 

10: return T 


3 Trail decompositions 


3.1 Choosing trails: two approaches 


Two approaches to proving the first half of Theoremj^are given in | West 20011 , and each 
corresponds to a variation of our algorithm. The first approach is to create k "pseudo¬ 
edges" connecting the 2k odd-degree vertices, and then to find an Eulerian tour on the 
surgically altered graph. To decompose the graph into trails, we then walk along the 
tour (which by construction enumerates every edge in the original graph exactly once). 
Every time a pseudo-edge is encountered, we mark the start of a new trail. See Algo- 
rithm[^for a complete description of the approach. The second approach is to iteratively 
choose a pair of odd-degree vertices and remove a shortest path connecting them. Any 
component that is disconnected from the graph then has an Eulerian tour and can be 
appended onto the trail at the point of disconnection. 

Both of these approaches are generic graph decomposition algorithms and are not 
tailored for our specific application. Algorithm is appealing from a computational 
perspective, as finding an Eulerian tour can be done in linear time |Eleischner and Eleis- 
T990|. The latter approach is computationally more intensive during the de- 


chner 


composition phase, but enables control over how to choose each odd-degree pair and 
thus allows us greater control over the construction of our trails. In the case where the 
graph structure remains fixed across multiple problem instances, it may be worthwhile 
to spend a little extra time in the decomposition phase if it yields trails that will reduce 
the average or worst case time to convergence. This leaves an open question: what 
makes a good trail set? 

Given that our dual updates are quick and exact, one intuitive goal is to find well- 
balanced trail sets where all of the trails are fairly long and can maximally leverage 
the linear-time solver. Einding the trail decomposition that maximizes this metric is 
in general a non-trivial problem, since removing an edge will alter the structure of the 
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Algorithm 2 Median trails heuristic algorithm. Tries to find a set of trails with a high 
mean-to-variance length ratio. 

Input: Graph Q = (V, T) 

Output: Set of trails T 
1: Subgraphs set S' ^ {G} 

2: r ^0 

3: while |S| > 0 

4: for each subgraph s = (i/, e) in 5 

5: u <— OddDegreeVertices(s) 

6: if |T| = 0 

7: t ^ EulerianCircuit(s) 

8: else if 1^1 = 2 

9: t ^ EulerianTour(Ei, E 2 ) 

10: else 

11: P ^ 0 

12: for each pair (z/j, Uj) in PairCombinations(E) 

13: 72 ^ 72 U ShortestPath(e, Ui, i/j) 

14: t ^ MedianLengthPath(72) 

15: T ^ Tut 

16: CheckSubgraph(S, s) 

17: return T 


subgraph and may change the structure of the trails added in following iterations. More¬ 
over, exhaustively exploring the effects of adding each edge is unlikely to be tractable 
even for small graphs. We therefore choose a heuristic that balances greedy trail se¬ 
lection with potential downstream effects. Por instance, adding the maximum-length 
trail at each iteration would add a single long trail to the resulting trail set, but could 
potentially leave many short trails remaining. Conversely, adding the minimum-length 
trail is clearly not a good choice as it will generate many short trails directly. We found 
that choosing the median-length trail tends to perform w ell e mpirically across a range 
of graph types, and this is the heuristic used in Algorithm |2P| 

3.2 Grid-graph example 

It may not be immediately clear that the trail decomposition strategy makes any differ¬ 
ence to the optimization algorithm. After all, both strategies produce trails consisting 
of the same number of total edges. Even in extreme cases when it seems obvious that it 
would likely have an effect (e.g. a single long trail vs. a set of single edges), it is hard to 
know the extent to which the graph decomposition can impact algorithm performance. 

'For computational purposes, Algorithmj^handles the graph split merging edge-case by simply creating 
two separate trails. The result is a small, but not necessarily minimal, set of trails, unlike Algorithm]^ which 
yields a provably minimal set of trails. 
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To help build this intuition, we delay the description of our optimization algorithm and 
instead first present a motivating example. 

Consider a simple 100 x 100 grid graph example, where each node has an edge to 
its immediate neighbors. Figure]^ shows the distribution of trail lengths for the results 
of both Algorithms (Pseudo-tour) and [^(Medians). Clearly the pseudo-tour trails are 
much more dispersed, producing a large number of short (< 50 nodes) and long (>150 
nodes) trails. Conversely, the medians algorithm generates mostly moderate trails of 
length approximately 100. For a grid graph, it is also straightforward to hand-engineer 
a very well-balanced set of trails: simply use each of the 100 rows and 100 columns as 
trails. This produces a small set of trails each of length 100; the row and column trails 
are noted via the dashed red line in the figure. 


Pseudo-tour 


0.025 


0.020 


Medians 



0 50 100 150 200 250 300 350 >400 

Trail length 


50 100 150 200 250 300 350 >400 

Trail length 


(a) 


Trails 

Steps 

Rows+Cols 

407.100 
(+/-13.079) 

Pseudo-tour 

1506.820 
(+/- 42.261) 

Medians 

1073.690 
(+/- 31.062) 

(b) 



Figure 1: (Left) The distribution of trail sizes for the two graph decomposition algo¬ 
rithms on a 100 X 100 grid graph. The Pseudo-tour algorithm produces a large number 
of short and long trails, where as the Medians algorithm generates more moderately 
sized trails. The red line marks the size of every trail if using a simple row and column 
trail set (high mean:variance ratio). (Right) The mean number of steps to convergence 
across 100 independent trials on the 100 x 100 grid graph, with standard error in paren¬ 
theses. 


The results of 100 independent trials using each of the trail sets are displayed in the 
table in Figure]^ The difference in performance is striking: the row and column trails 
converge nearly four times faster than the pseudo-tour trailsj^ The medians decompo¬ 
sition also performs substantially worse than the row and column trails, but nearly 1/3 
better than the pseudo-tour trails. Clearly, the trail choice can have a dramatic impact 

^In general, step counts are directly translatable into wall-clock time for a serial implementation of our 
algorithm. We choose to show steps when comparing trails as they are not effected by externalities like 
context switching, making them slightly more reliable. 
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on convergence time, especially for large-scale problems where a 30-60% speedup could 
be a substantial improvement. 

Despite the performance differences, it is important to remember that the two de¬ 
composition algorithms have different goals. The pseudo-tour approach of Algorithm 
aims to quickly find a minimal set of trails, which maximizes the mean trail length. 
The downside is the risk of high variance in the trail lengths and potentially slower con¬ 
vergence. Conversely, the medians heuristic in Algorithm [^attempts to select trails that 
will be well-balanced, but comes with a high preprocessing cost since it must consider 
all 2{k — t)x 2{k — t — 1) paths at every step and requires k steps. While in practice 
we can randomly sample a subset of candidate pairs for large graphs, the difference in 
preprocessing times between the two algorithms can still be massive: for the DNA ex¬ 
ample in Section]^ Algorithm Intakes hours while Algorithm [^finishes in less than ten 
seconds. 


4 Experiments 


4.1 Alternative approaches 

To evaluate the effectiveness of our algorithm, we compare it against several other ap¬ 
proaches to solving 00 ___ _ 

In one method used in the recent literature | Wahlberg et al. 2012) Tansey et al. 2014| , 
two sets of slack variables are introduced in order to reformulate the problem such that 
the dominant cost is in repeatedly solving the system (/ -|- L)x = b, where L is the graph 
Laplacian and only b changes at every iteration. Modern algebraic multigrid (AMG) 
methods have become highly efficient at solving such systems for certain forms of L, and 
in the following benchmarks we compare against an implementation of this approach 
using a smoothed aggregation AMG approach | Bell et al. 20111 with conjugate gradient 
acceleration. In all reported timings, we do not include the time to calculate the multi¬ 
grid hierarchy, though we note this can add significant overhead and may even take 
longer than solving the subsequent linear system. 

In the spatial case, such as the grid graph from Section|3ja highly efficient parametric 
max-flow method is available [Ghambolle and Darbon 2009| . It is also straight-forward 
in many spatial cases to hand-engineer trails, as in our row and column trails from the 
grid graph example. Indeed, the row and column trails are equivalent to the "proximal 
stacking" method of [Barbero and Sra 20141, which uses a proximal algorithm with 


the row and columns split and updated with an efficient ID solver for anisotropic total 
variation denoising on geometric graphs. In our benchmarks, we consider both the max- 
flow and row and column trails, to give an idea of performance improvement when 
special structure can be leveraged in the graph. 

We also test different trail decomposition strategies. To demonstrate the benefits of 
our trail decomposition strategy, we compare against a naive decomposition that simple 


^In our effort to be thorough, we also attempted to evaluate against smoothing proximal gradient en| 
et al. 2012| , but were unable to achieve adequate convergence with this method (see Appendix [A| for de¬ 
fails). 
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treats each edge as a separate trail. We also demonstrate the advantage of our median 
heuristic by testing it against a similar heuristic that randomly adds trails. Finally, we 
show how performance of the row and column trails degrades as their length decreases, 
to illustrate the benefit of longer trails. For all trail decomposition strategies, we solve 
the ID fused lasso updates using the ^namic programming solution of [Johnson 2013[ 
implemented in the glmgen packagelj 


4.2 Performance criteria 

For all experiments, we report timings averaged across 30 independent trials; for trail 
decomposition strategy comparisons, we report the number of steps rather than wall- 
clock time. In each trial, we randomly create four blobs of different mean values, with 
each blob being approximately 5% of the nodes in the graph. The algorithms are then 


run until 10 4 precision and we use a varying penalty acceleration heuristic j Boyd et al 


ZOllJ , though performance results are not meaningfully different for different precision 
levels or fixed penalty parameters. 

With the exception of the spatially-structured methods, all algorithms are compared 
on two synthetic and two "real-world" graphs. For the synthetic examples, we create 
randomly-connected graphs with approximately 99.8% sparsity, and square grids with 
adjacency defined by each node's immediate neighbor; performance is then tracked as 
the number of nodes in the graphs increases. However, randomly generated networks 
often produce graph forms that are unlikely to be seen in the wild; conversely, square 
grids are likely too uniformly structured to be common in many scientific applications, 
with computer vision being a notable exception. Therefore, to better understand per¬ 
formance in practice, we also conduct benchmarks on two real-world datasets: a DNA 
electrophoresis model and an acoustic vibrobox simulation, both taken from The Uni¬ 
versity of Florida sparse matrix collection [Davis and Hu ZOllf . Figure shows the 
adjacency structure of these matrices. We note that while both exhibit some degree of 
structure, the DNA example presents a much higher degree of regularity and is more 
similar to a grid graph than the vibrobox example. 


4.3 Trail selection results 

The results of our graph decomposition benchmarks are shown in Table In each ex¬ 
ample, the medians strategy performs as well or better than the other three methods. 
Additionally, the naive edge-wise decomposition strategy performs substantially worse 
overall than any of the trail strategies. For the two graphs with a high degree of struc¬ 
ture, namely the grid graph and DNA model, our medians strategy significantly out¬ 
performs all of the other methods. When the graphs are less structured, the benefits of 
clever decomposition fade. We also note that while the random trails perform better on 
the grid than the pseudo-tour, it is important to remember that the former is an iterative 
approach that requires much higher preprocessing time, as noted in Section]^ 

^https://github.com/statsmaths/glmgen 
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(a) Vibrobox 


(b) DNA Electrophoresis 


Figure 2: The connectivity structure of the two real-world graphs; black cells denote 
connected nodes. An accoustic vibrobox simulation with 12K nodes. [2b| A DNA 
electrophoresis model with 39K nodes. 


To help shed potential light on the cause of the varying performances, Figure|^shows 
the distributions of trails for the pseudo-tour and medians strategies across the four 
benchmark graphs. On random graphs ([^and[3b| the distributions look very similar, 
with little-to-no balancing aid from the median selection heuristic. When the graph is 
highly structured, as in the grid graph ([^and[M]l, the median selection helps to balance 
out the overall trail set and pushes the overall size of the median trail up. In the vibrobox 
example (|^and|^ the trails are better balanced by the medians strategy, but at the cost 
of a much shorter median trail length, likely negating any potential gains from the better 
balanced trails. Finally, in the DNA example 


I and 3h I the medians strategy pays 
slightly in terms of typical trail length, but the drastic improvement in balance leads to 
a reasonable performance improvement as noted in Table 

This line of reasoning is further supported by Figure]^ which shows the performance 
of splitting the row and column hand-engineered trails on a 256 x 256 grid graph. For 
this experiment, we recursively break each trail in half and run it to convergence on 
the same graph. As the figure shows, the length of the trails is directly related to the 
convergence rate of the algorithm. 


4.4 Results against other methods 

Figure]^ shows the results of a comparison of the two trails methods versus the alterna¬ 
tive implementations on random (j^ and grid ( [5b| graphs. In both cases, we see that the 
AMG approach takes substantially longer than either trail strategy (note this is a log-log 
plot). On random graphs (Figure]^, there is little difference in performance between 
the two trail strategies, possibly due to the lack of underlying structural regularity in 
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Graph 

Nodes 

Pseudo-tour 

Medians 

Random 

Edges 

Random 

50625 

178 (6) 

186 (6) 

183 (9) 

222 (9) 

Grid 

50625 

11502 (856) 

6254 (494) 

9005 (535) 

15228 (723) 

Vibrobox 

12328 

111 (0) 

111 (0) 

111 (1) 

227 (1) 

DNA 

39082 

379 (31) 

275 (29) 

359 (31) 

365 (32) 


Table 1: Benchmark results for different trail decomposition strategies. Values reported 
are the mean number of steps to convergence, with standard errors in parentheses. 


the graph. On the other hand, on grid graphs the medians strategy converges approx¬ 
imately 25% faster than the pseudo-tour trails. We also note that the simple rows and 
columns hand-engineered strategy is very competitive with the specialized max-flow 
method which only works on the limited class of spatial graphs and with the specific 
squared loss function; our method retains none of these constraints. 


5 Discussion 


We have presented a novel approach to solving the fused lasso on a generic graph by 
the use of a graph decomposition strategy. Experiments demonstrate that our algorithm 
performs well empirically, even against specialized methods that handle only a limited 
class of graphs and loss functions. The flexibility and speed of our algorithm make it 
immediately useful as a drop-in replacement for GEL solvers currently being used in 
scientific applications (e.g., [Tansey et al. 2014[ ). 

There remain two outstanding and related questions regarding our trail decomposi¬ 
tion approach: 1) what are the precise properties of a graph decomposition that lead to 
rapid convergence of our ADMM algorithm, and 2) how can we (efficiently) decompose 
a graph such that the resulting trails contain these properties? While we have presented 
some intuitive concepts around the idea of a balanced set of long trails, we are far from 
answering these questions with certainty It may be possible to leverage recent work 
on ADMM convergence rate analysis | Nishihara et al. 2015| and in particular work on 
network models JEougner and Bo}^ 20151 to this end. At current, our approaches are 
purely heuristics driven by empirical success and we plan to explore ways to decompose 
graphs in a more principled manner in future work. 
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A Gradient Smoothing Alternative 


An alternative approach to solving Q is to iteratively solve a smooth approximation 
|Nesterov 20051 of the true non-smooth loss function. This approach has been adapted 
to similar problems as the GFL, for instance the "graph-guided fused lasso" problem 
was solved via Smoothing Proximal Gradient (SPG) |Ghen et al. 2012|. Note that SPG is 
designed to solve problems of the form /(x) -|-( 7 (x) -|-/i(x), where /(x) is a smooth convex 
function, gr(x) is convex, and /i(x) is the Li penalty on x. The /i(x) term is not present in 
0 and the "graph-guided fused lasso" could alternatively be named the "sparse GFL", 


in keeping with the lasso nomenclature (see, for example, the sparse group lasso | Fried- 


man et aPj 2010|). Nevertheless, if we set the /i(x) penalty weight to zero, we recover the 


following algorithm for solving the GFL: 


Algorithm 3 Gradient smoothing algorithm for the GFL. 

Input: Observations y, sparse \E\ x |1/| edge differences matrix D, error tolerance e 
Output: Graph-smoothed (3 values 

1: ^ jiy 

2: while /3 not converged 

3: Q;*+^ ^ 

4 : V+ g) = (3^ - y + 

5: /3*+'^/3*-r/V^(/ + 5) 

6 : return /3 


In the above algorithm, S is the elementwise truncation operator on the interval 
[—1,1], g is the smooth approximation to g, and g is the typical gradient descent stepsize 
(determined via backtracking line search in our implementation). 

While Algorithm 1^ converges in the limit [ Nestero^ |2005| , in practice it often gets 
stuck in a saddle point in our experiments. Figure shows an example of a 100 x 100 
grid-fused lasso problem solved to 10“® precision using Algorithmj^and our trail-based 
approach. The trail method clearly recovers nearly the true underlying parameters, 
while the SPG-type method is substantially under-smoothed, particularly with respect 
to the background noise in the image. 
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Figure 3: Trail length distributions across the different benchmark graphs; dashed lines 
are median trail lengths. 


14 













6000 



Trail Length 

Figure 4: Performance degradation in a 256 x 256 grid graph example as the row and 
column trails shrink in length. 




(a) Random graph (b) Grid graph 

Figure 5: Log-log plot of comparing performance of the two trail decomposition strate¬ 
gies to the alternative approaches. In both cases, the AMG approach is substantially 
slower. On random sparse graphs both trails are effectively equal, where as on grid 
graphs the medians heuristic converges approximately 25% faster. 
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Figure 6: Comparison of the different methods for solving the GFL problem. The SPG 
method recovers a solution that is substantially under-smoothed. 
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